[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
Open
[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
Conversation
Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…sisResult, FeatureProfileResult) Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the orchestration layer for BQ-based graph data quality checks: - Tier 1 hard-fails (dangling edges, referential integrity, duplicate nodes) raise DataQualityError carrying a partially populated result. - Tier 2 core metrics (counts, degree stats, top-K hubs, INT16 clamp, NULL rates) plus Python-side feature memory and neighbor-explosion estimates. - Tier 3 label/heterogeneous checks auto-enabled by config (label_column presence; multiple edge tables). - Tier 4 opt-in placeholders (power-law exponent from degree stats). Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…assets Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the report_generator module that stitches AI-owned template,
styles, and chart JS into a single self-contained HTML report by
replacing the four INJECT_* placeholders. Adds a golden-file snapshot
test (and four structural tests) so future AI-driven edits to the
report assets fail fast until the snapshot is regenerated. Registers
the *.ai.{html,js,css} assets as package-data so importlib.resources
can resolve them from an installed wheel.
Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the main orchestrator class that coordinates graph structure analysis, feature profiling, and HTML report generation. Includes CLI entry point with argparse for analyzer_config_uri and resource_config_uri. Co-Authored-By: shubhamvij <svij@snapchat.com>
…deferred) Co-Authored-By: shubhamvij <svij@snapchat.com>
Narrows the Union return type for mypy in the direct-merge test path. Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
Sits alongside SPEC.md to separate product requirements (why and what) from technical implementation contract (how). Both are AI-owned and together form the input for regenerating report.ai.html, charts.ai.js, and styles.ai.css. Co-Authored-By: shubhamvij <svij@snapchat.com>
… 1-pager, engineering spec Colocates all planning docs for the BQ Data Analyzer feature: - 20260415-bq-data-analyzer.md: full design doc with 4-tier validation, cost control, tradeoff analysis - 20260415-bq-data-analyzer-references.md: literature review of 18 production GNN papers with 100+ findings, common themes, and consolidated threshold table - 20260416-data-analyzer-1-pager.md: executive summary for peer engineers and decision makers - 20260416-data-analyzer-engineering-spec.md: per-layer implementation plan that the analyzer code in this branch follows Co-Authored-By: shubhamvij <svij@snapchat.com>
…trator Previously the orchestrator generated the HTML in memory but left the upload as a TODO, forcing practitioners to copy a Python snippet to see the output. Now DataAnalyzer.run() writes report.html under config.output_gcs_path, detecting the scheme: - gs:// URIs upload via GcsUtils.upload_from_string() - local paths write via pathlib, creating parent dirs as needed Returns the final path (GCS URI or resolved local path) so the CLI can log it and practitioners can open the file directly. Tests cover both local and mocked-GCS paths plus trailing-slash handling. Co-Authored-By: shubhamvij <svij@snapchat.com>
Quickstart-first guide at gigl/analytics/README.md covering: - 3-step quickstart (auth, YAML config, CLI command) with a single entry point that now writes report.html to disk or GCS - Tier summary table (what runs when) - Interpretation table with thresholds + "what to do" actions drawn from the 18-paper literature review - Advanced config keys (opt-in Tier 3/4, label_column, timestamp_column, fan_out) - Python API snippet for programmatic access - graph_validation sub-package pointer - Scope and limitations (FeatureProfiler stub, Tier 4 queries TODO) - Links to design doc, literature review, 1-pager, engineering spec, report PRD, and report SPEC Co-Authored-By: shubhamvij <svij@snapchat.com>
Changes from the review pass: README fixes: - Remove all docs/plans/* links (the plans were intentionally deleted in d3f1eb8). Inline the relevant paper citations into the threshold table so readers aren't pointed at 404s. - Add "Prerequisites" line pointing at the GiGL installation guide so the quickstart doesn't assume uv/deps are already set up. - Mark Tier 4 flags (compute_homophily, compute_connected_components, compute_clustering, timestamp_column) as not-yet-implemented in both the tier table and the Advanced Config section, not only in the Scope section at the bottom. - Add the power-law exponent mention to the Tier 4 row (was only in scope notes; it's actually computed today). - Document the heterogeneous-graph referential-integrity caveat (analyzer currently joins each edge table against node_tables[0]). - Link to tests/test_assets/analytics/golden_report.html so a reader can preview the output before authenticating to BQ. Config fix: - NodeTableSpec.feature_columns: MISSING -> field(default_factory=list) so that nodes with no features are legal. Previously users got a cryptic OmegaConf MissingMandatoryValue error, and no-feature nodes are a real use case. - Add a regression test covering the no-feature-columns case. All 31 analytics unit tests pass. mypy clean. check_format clean. Co-Authored-By: shubhamvij <svij@snapchat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DataAnalyzermodule that takes a YAML config pointing at BQ node/edge tables and generates a single self-contained HTML report covering data quality, feature distributions, and graph structure — so engineers can diagnose training data issues in minutes instead of after a failed training run.Changes
gigl/analytics/data_analyzer/—config.py,types.py,queries.py(18 SQL templates),graph_structure_analyzer.py,feature_profiler.py(stub),data_analyzer.pyorchestrator + CLIgigl/analytics/data_analyzer/report/—PRD.md,SPEC.md,report_generator.py, and AI-ownedreport.ai.html,charts.ai.js,styles.ai.css(regenerable from PRD + SPEC)tests/unit/analytics/data_analyzer/— 26 unit tests covering config parsing, SQL templates, analyzer orchestration, and HTML snapshottests/test_assets/analytics/—sample_analyzer_config.yaml+golden_report.htmlsnapshotdocs/plans/— design doc, literature review, 1-pager, engineering spec (all colocated)pyproject.toml— package-data declaration so.ai.*assets ship in installed wheelsTest plan
uv run python -m unittest discover -s tests/unit/analytics -p "*_test.py" -t .→ 26/26 passmake type_check→ clean on 651 filesmake check_format→ cleanv1 scope cuts (follow-up PRs)
GenerateAndVisualizeStats,IngestRawFeatures,init_beam_pipeline_optionsfrom the existing DataPreprocessor) will land in a follow-up PR.Docs
docs/plans/20260415-bq-data-analyzer.mddocs/plans/20260415-bq-data-analyzer-references.mddocs/plans/20260416-data-analyzer-1-pager.mddocs/plans/20260416-data-analyzer-engineering-spec.mdgigl/analytics/data_analyzer/report/PRD.mdgigl/analytics/data_analyzer/report/SPEC.md